Investigation on the Relationship Between Amount of Calories and Rating of Recipes¶
Name(s): Baoyi Wan & Qinyuan Gu
Website Link: https://github.com/clytzechoo/DSC80_Final_Project/edit/main/README.md
import pandas as pd
import numpy as np
from scipy import stats
import plotly.figure_factory as ff
import plotly.express as px
from networkx.algorithms.traversal import dfs_successors
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.model_selection import train_test_split, GridSearchCV
import ast
pd.options.plotting.backend = 'plotly'
# from dsc80_utils import * # Feel free to uncomment and use this.
#!pip install tabulate
Step 1: Introduction¶
Food is an essential part of our daily lives, and cooking serves as a hobby that brings joy and a sense of accomplishment to many. While numerous people are fond of fatty foods due to their delicious taste and the satisfaction they bring when eaten, this type of food, particularly those rich in saturated fat, contains more calories. The health risks associated with them, such as cardiovascular diseases, diabetes, and obesity, cannot be ignored. A Harvard University study published in JAMA Internal Medicine in 2016 showed that a long - term diet high in saturated fat is linked to a 2 - fold increased risk of coronary artery calcification. Another study in Diabetes Care in 2019 found that when consuming more than 1,000 calories per day for five days, insulin sensitivity decreased by 27%. With this knowledge in mind, we aim to explore the relationship between the rating of a recipe and the amount of calories and saturated fat in it. Our goal is to predict the average rating of a recipe and understand its connection with the calories and saturated fat content in recipes. To achieve this, we are analyzing two datasets that include recipes and ratings posted on food.com since 2008.
The first dataset, recipes, contains 83782 rows, indicating 83782 unique recipes, with 10 columns.
The second dataset, interactions, contains 731927 rows and each row contains a review from the user on a specific recipe.
we are investigating whether people rate high calorie recipes and the common calorie recipes on the same scale. To facilitate the investigation of our question, we separated the values in the 'nutrition' columns into the corresponding columns, 'calories (#)', 'total fat (PDV)', 'sugar (PDV)', scurated fat (PDV), etc. PDV, or percent daily value shows how much a nutrient in a serving of food contributes to a total daily diet. Moreover, we calculated the proportion of sacurated fat in terms of calories out of the total calories of a given recipe and stored the information in a new column, 'prop sacurated fat'. because high scureated fat in here will be referring to the recipes with value 'prop_sugar' higher than the average 'prop_sugar'. The most relevant columns to answer our question are 'calories(#)', 'scureated fat (PDV)', 'prop scurated fat', described above, 'rating', which is the rating that user gave on a recipe, and 'avg_rating', which are the average of the ratings on each unique recipes.
Our research could lead to future work on diving deeper into how much awareness people have on the negative health effects of high colorie recipes.
# Load data
recipes = pd.read_csv('RAW_recipes.csv')
interactions = pd.read_csv('interactions.csv')
Step 2: Data Cleaning and Exploratory Data Analysis¶
we conducted the following data cleaning steps:
Left merge 'recipes', 'interactions' two datasets.
Replace 0 rating to np.nan.
Get average rating for each recipe and add avg_rating to recipes dataset.
Retrieve data from nutrition list-like string.
Seperate values for research to series: calories, total fat, protein, sacurated fat, carbohydrates.
We applied a lambda function then converted the columns to floats.
Add 'prop scurated fat' to the dataframe.
prop sacurated fat is the proportion of sacurated fat of the total calories in a recipe. We use the values in the scurated fat (PDV) column to divide by 100% to get it in the decimal form. Then, we multiply by 20 to convert the values to grams of sacurated fat since 20 grams of scurated fat is the 100% daily value (PDV). We got this value of 20 grams from experimenting on food.com with different amounts of scurated fat in a recipe. The experimentation allows us to understand the nutrition formula used on the website for recipes. Lastly, we multiply by 9 since there are 9 calories in 1 gram of sugar. After all these conversions, we end up with the number of calories of scurated fat, so we can divide by the total amount of calories in the recipe to get the proportion of scurated fat of the total calories. It is convenient to show the values will be between 0 and 1.
Add 'is_high_sacurated_fat' to the dataframe.
'is_high_sacurated_fat' is a boolean column checking if proportion of sacurated fat lager than 10% of calories. This step separates the recipes into two groups, ones that are high sacurated fat and ones that are not.
Add 'is_high_calorie' to the dataframe.
'is_high_calorie' is a boolean column checking if calories of recipe more than 600. This step separates the recipes into two groups, ones that are high calorie and ones that are not.
Add 'is_vegan' to the dataframe.
'is_vegan' is a boolean column checking if 'vegan' is in tags of recpie. This step separates the recipes into two groups, ones that are vegan and ones that are not.
Add 'is_quick' to the dataframe.
'is_quick' is a boolean column checking if 'quick' is in tags of recpie. This step separates the recipes into two groups, ones that are quick and ones that are not.
# Data Cleaning
# left merge two datasets
result = pd.merge(recipes, interactions, left_on='id', right_on='recipe_id', how='left')
# replace 0 rating to np.nan
result['rating'] = result['rating'].replace(0, np.nan)
# get average rating for each recipe
avg_ratings = result.groupby('id')['rating'].mean().rename('avg_rating')
# add avg_rating to recipes dataset
recipes = recipes.merge(avg_ratings, left_on='id', right_index=True, how='left')
# retrieve data from nutrition list like string
recipes['nutrition'] = recipes['nutrition'].apply(ast.literal_eval)
# need values for research:calories, total fat, protein, sacurated fat, carbohydrates
recipes['calories'] = recipes['nutrition'].apply(lambda x: x[0] if isinstance(x, list) and x[0] > 0 else np.nan)
recipes['total fat'] = recipes['nutrition'].apply(lambda x: x[1] if isinstance(x, list) and x[1] > 0 else np.nan)
recipes['protein'] = recipes['nutrition'].apply(lambda x: x[4] if isinstance(x, list) and x[4] > 0 else np.nan)
recipes['sacurated fat'] = recipes['nutrition'].apply(lambda x: x[5] if isinstance(x, list) and x[5] > 0 else np.nan)
recipes['carbs'] = recipes['nutrition'].apply(lambda x: x[6] if isinstance(x, list) and x[6] > 0 else np.nan)
recipes['prop sacurated fat'] = recipes['nutrition'].apply(lambda x: x[5]/100*20*9/x[0] if isinstance(x, list)
and x[5] > 0 else np.nan)
recipes['is_high_calorie'] = recipes['nutrition'].apply(lambda x: True if isinstance(x, list) and x[0] > 600
else False)
recipes['is_high_sacurated_fat'] = recipes['prop sacurated fat'].apply(lambda x: True if x > 0.1 else False)
recipes['is_vegan'] = recipes['tags'].str.contains('vegan', na=False)
recipes['is_quick'] = recipes['tags'].str.contains('quick', na=False)
recipes
| name | id | minutes | contributor_id | submitted | tags | nutrition | n_steps | steps | description | ... | calories | total fat | protein | sacurated fat | carbs | prop sacurated fat | is_high_calorie | is_high_sacurated_fat | is_vegan | is_quick | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 brownies in the world best ever | 333281 | 40 | 985201 | 2008-10-27 | ['60-minutes-or-less', 'time-to-make', 'course... | [138.4, 10.0, 50.0, 3.0, 3.0, 19.0, 6.0] | 10 | ['heat the oven to 350f and arrange the rack i... | these are the most; chocolatey, moist, rich, d... | ... | 138.4 | 10.0 | 3.0 | 19.0 | 6.0 | 0.247110 | False | True | False | False |
| 1 | 1 in canada chocolate chip cookies | 453467 | 45 | 1848091 | 2011-04-11 | ['60-minutes-or-less', 'time-to-make', 'cuisin... | [595.1, 46.0, 211.0, 22.0, 13.0, 51.0, 26.0] | 12 | ['pre-heat oven the 350 degrees f', 'in a mixi... | this is the recipe that we use at my school ca... | ... | 595.1 | 46.0 | 13.0 | 51.0 | 26.0 | 0.154260 | False | True | False | False |
| 2 | 412 broccoli casserole | 306168 | 40 | 50969 | 2008-05-30 | ['60-minutes-or-less', 'time-to-make', 'course... | [194.8, 20.0, 6.0, 32.0, 22.0, 36.0, 3.0] | 6 | ['preheat oven to 350 degrees', 'spray a 2 qua... | since there are already 411 recipes for brocco... | ... | 194.8 | 20.0 | 22.0 | 36.0 | 3.0 | 0.332649 | False | True | False | False |
| 3 | millionaire pound cake | 286009 | 120 | 461724 | 2008-02-12 | ['time-to-make', 'course', 'cuisine', 'prepara... | [878.3, 63.0, 326.0, 13.0, 20.0, 123.0, 39.0] | 7 | ['freheat the oven to 300 degrees', 'grease a ... | why a millionaire pound cake? because it's su... | ... | 878.3 | 63.0 | 20.0 | 123.0 | 39.0 | 0.252078 | True | True | False | False |
| 4 | 2000 meatloaf | 475785 | 90 | 2202916 | 2012-03-06 | ['time-to-make', 'course', 'main-ingredient', ... | [267.0, 30.0, 12.0, 12.0, 29.0, 48.0, 2.0] | 17 | ['pan fry bacon , and set aside on a paper tow... | ready, set, cook! special edition contest entr... | ... | 267.0 | 30.0 | 29.0 | 48.0 | 2.0 | 0.323596 | False | True | False | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 83777 | zydeco soup | 486161 | 60 | 227978 | 2012-08-29 | ['ham', '60-minutes-or-less', 'time-to-make', ... | [415.2, 26.0, 34.0, 26.0, 44.0, 21.0, 15.0] | 7 | ['heat oil in a 4-quart dutch oven', 'add cele... | this is a delicious soup that i originally fou... | ... | 415.2 | 26.0 | 44.0 | 21.0 | 15.0 | 0.091040 | False | False | False | False |
| 83778 | zydeco spice mix | 493372 | 5 | 1500678 | 2013-01-09 | ['15-minutes-or-less', 'time-to-make', 'course... | [14.8, 0.0, 2.0, 58.0, 1.0, 0.0, 1.0] | 1 | ['mix all ingredients together thoroughly'] | this spice mix will make your taste buds dance! | ... | 14.8 | NaN | 1.0 | NaN | 1.0 | NaN | False | False | True | False |
| 83779 | zydeco ya ya deviled eggs | 308080 | 40 | 37779 | 2008-06-07 | ['60-minutes-or-less', 'time-to-make', 'course... | [59.2, 6.0, 2.0, 3.0, 6.0, 5.0, 0.0] | 7 | ['in a bowl , combine the mashed yolks and may... | deviled eggs, cajun-style | ... | 59.2 | 6.0 | 6.0 | 5.0 | NaN | 0.152027 | False | True | False | False |
| 83780 | cookies by design cookies on a stick | 298512 | 29 | 506822 | 2008-04-15 | ['30-minutes-or-less', 'time-to-make', 'course... | [188.0, 11.0, 57.0, 11.0, 7.0, 21.0, 9.0] | 9 | ['place melted butter in a large mixing bowl a... | i've heard of the 'cookies by design' company,... | ... | 188.0 | 11.0 | 7.0 | 21.0 | 9.0 | 0.201064 | False | True | False | False |
| 83781 | cookies by design sugar shortbread cookies | 298509 | 20 | 506822 | 2008-04-15 | ['30-minutes-or-less', 'time-to-make', 'course... | [174.9, 14.0, 33.0, 4.0, 4.0, 11.0, 6.0] | 5 | ['whip sugar and shortening in a large bowl , ... | i've heard of the 'cookies by design' company,... | ... | 174.9 | 14.0 | 4.0 | 11.0 | 6.0 | 0.113208 | False | True | False | False |
83782 rows × 23 columns
Univariate Analysis¶
In this analysis, we first examined the distribution of calories in a recipe. The distribution histogram(fig1) shows that rapid decreasing trend to the right, indicating that most of the recipes in the dataframe have a relative low calories, indicating that as the calories in recipes gets higher, there are less of those recipes in the dataframe.
We then examined the distribution of average rating of recipes with different calories. The distribution scatter(fig2) shows that most rating accumulated around 3 to 5 stars. It seems that people like to give a relative high rating to the recipe. The more calories in one recipe, the higher rating people will give. That is interesting.
We also examined the distribution of proportion of scurated fat in a recipe. The histogram(fig3) shows that 5% - 15% scurated fat recipes got most number. With increasing of proportion of scurated fat, the number of recipes decreased rapidly.
We examined the distruibution of proportion of sacurated fat with average rating of recipes(fig4). The result is similar with fig2 but more concentrated 4 - 5 stars.
# Univariate Analysis
# fig1: Distribution of Calories in Recipes
fig1 = px.histogram(recipes, x='calories', nbins=50, title='Distribution of Calories in Recipes')
fig1.update_layout(xaxis_title='Calories', yaxis_title='Count')
fig1.show()
# fig2: Distribution of rating in recipes
interactions['rating'] = interactions['rating'].replace(0, np.nan)
avg_ratings = interactions.groupby('recipe_id')['rating'].mean().reset_index()
avg_ratings.columns = ['id', 'avg_rating']
recipes_merged = pd.merge(recipes, avg_ratings, on='id')
fig2 = px.scatter(recipes, x='calories', y='avg_rating',
title='Calories vs. Average Recipe Rating',
labels={'calories': 'Calories', 'avg_rating': 'Average Rating'},
opacity=0.5)
fig2.show()
# fig3: Distribution of Proportion of Sacurated Fat in Recipes
fig3 = px.histogram(recipes, x='prop sacurated fat', nbins=10, title='Distribution of Proportion of '
'Sacurated Fat in Recipes')
fig3.update_layout(xaxis_title='Proportion of Sacurated Fat in Recipe', yaxis_title='Count')
fig3.show()
# fig4: Proportion of Sacurated fat vs. Average Recipe Rating
fig4 = px.scatter(recipes, x='prop sacurated fat', y='avg_rating',
title='Proportion of Sacurated fat vs. Average Recipe Rating',
labels={'prop sacurated fat': 'Proportion of Sacurated fat', 'avg_rating': 'Average Rating'},
opacity=0.5)
fig4.show()
# fig5: Average Rating by Number of Recipe Steps
grouped_table = recipes.groupby('n_steps')['avg_rating'].mean().reset_index().sort_values('n_steps')
print(grouped_table.head(10).to_markdown(index=False))
fig5 = px.bar(grouped_table, x='n_steps', y='avg_rating',
title='Average Rating by Number of Recipe Steps',
labels={'n_steps': 'Number of Steps', 'avg_rating': 'Average Rating'})
fig5.show()
| n_steps | avg_rating | |----------:|-------------:| | 1 | 4.64813 | | 2 | 4.66612 | | 3 | 4.65546 | | 4 | 4.64004 | | 5 | 4.61038 | | 6 | 4.61137 | | 7 | 4.62251 | | 8 | 4.62609 | | 9 | 4.61521 | | 10 | 4.60389 |
Bivariate Analysis¶
In this analysis, we first split dataframe to two group. One is High Calories(calories value greater than 600), another is Common Calories. The compare histogram shows that the number of rating to high calories are much less than common group. But the trendency of rating stars is increasing. most number of rating in 4 - 5 stars.
We then do the similar thing, split dataframe to two group. One is High sacurated fat(is_high_scurated_fat is Ture), another is Common group. The compare bar shows that the number of rating to high sacurated fat are more than common group. The trendency of rating stars is increasing. most number of rating in 4 - 5 stars. So that means: high calories are not equal to high sacurated fat. High calories recipes are more complex.
# Bivariate Analysis
recipes['rating_rounded'] = recipes['avg_rating'].dropna().round().astype(int)
recipes['rating_rounded'] = recipes['rating_rounded'].clip(1, 5)
df_plot1 = recipes.groupby(['is_high_calorie', 'rating_rounded']).size().reset_index(name='count')
df_plot2 = recipes.groupby(['is_high_sacurated_fat', 'rating_rounded']).size().reset_index(name='count')
fig = px.bar(
df_plot1,
x='rating_rounded',
y='count',
color='is_high_calorie',
barmode='group',
title='High Calories vs Common Calories',
labels={'rating_rounded': 'Rating', 'count': 'Count'},
category_orders={'avg_rating': [1, 2, 3, 4, 5],
'is_high_calorie': [True, False]},
)
fig.show()
fig = px.bar(
df_plot2,
x='count',
y='rating_rounded',
color='is_high_sacurated_fat',
barmode='group',
orientation='h',
title='High Sacurated Fat vs Common recipes',
labels={'rating_rounded': 'Rating', 'count': 'Count'},
category_orders={'avg_rating': [1, 2, 3, 4, 5],
'is_high_sacurated_fat': [True, False]},
)
fig.show()
Step 3: Assessment of Missingness¶
recipes_merged
| name | id | minutes | contributor_id | submitted | tags | nutrition | n_steps | steps | description | ... | total fat | protein | sacurated fat | carbs | prop sacurated fat | is_high_calorie | is_high_sacurated_fat | is_vegan | is_quick | avg_rating_y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 brownies in the world best ever | 333281 | 40 | 985201 | 2008-10-27 | ['60-minutes-or-less', 'time-to-make', 'course... | [138.4, 10.0, 50.0, 3.0, 3.0, 19.0, 6.0] | 10 | ['heat the oven to 350f and arrange the rack i... | these are the most; chocolatey, moist, rich, d... | ... | 10.0 | 3.0 | 19.0 | 6.0 | 0.247110 | False | True | False | False | 4.0 |
| 1 | 1 in canada chocolate chip cookies | 453467 | 45 | 1848091 | 2011-04-11 | ['60-minutes-or-less', 'time-to-make', 'cuisin... | [595.1, 46.0, 211.0, 22.0, 13.0, 51.0, 26.0] | 12 | ['pre-heat oven the 350 degrees f', 'in a mixi... | this is the recipe that we use at my school ca... | ... | 46.0 | 13.0 | 51.0 | 26.0 | 0.154260 | False | True | False | False | 5.0 |
| 2 | 412 broccoli casserole | 306168 | 40 | 50969 | 2008-05-30 | ['60-minutes-or-less', 'time-to-make', 'course... | [194.8, 20.0, 6.0, 32.0, 22.0, 36.0, 3.0] | 6 | ['preheat oven to 350 degrees', 'spray a 2 qua... | since there are already 411 recipes for brocco... | ... | 20.0 | 22.0 | 36.0 | 3.0 | 0.332649 | False | True | False | False | 5.0 |
| 3 | millionaire pound cake | 286009 | 120 | 461724 | 2008-02-12 | ['time-to-make', 'course', 'cuisine', 'prepara... | [878.3, 63.0, 326.0, 13.0, 20.0, 123.0, 39.0] | 7 | ['freheat the oven to 300 degrees', 'grease a ... | why a millionaire pound cake? because it's su... | ... | 63.0 | 20.0 | 123.0 | 39.0 | 0.252078 | True | True | False | False | 5.0 |
| 4 | 2000 meatloaf | 475785 | 90 | 2202916 | 2012-03-06 | ['time-to-make', 'course', 'main-ingredient', ... | [267.0, 30.0, 12.0, 12.0, 29.0, 48.0, 2.0] | 17 | ['pan fry bacon , and set aside on a paper tow... | ready, set, cook! special edition contest entr... | ... | 30.0 | 29.0 | 48.0 | 2.0 | 0.323596 | False | True | False | False | 5.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 83776 | zydeco soup | 486161 | 60 | 227978 | 2012-08-29 | ['ham', '60-minutes-or-less', 'time-to-make', ... | [415.2, 26.0, 34.0, 26.0, 44.0, 21.0, 15.0] | 7 | ['heat oil in a 4-quart dutch oven', 'add cele... | this is a delicious soup that i originally fou... | ... | 26.0 | 44.0 | 21.0 | 15.0 | 0.091040 | False | False | False | False | 5.0 |
| 83777 | zydeco spice mix | 493372 | 5 | 1500678 | 2013-01-09 | ['15-minutes-or-less', 'time-to-make', 'course... | [14.8, 0.0, 2.0, 58.0, 1.0, 0.0, 1.0] | 1 | ['mix all ingredients together thoroughly'] | this spice mix will make your taste buds dance! | ... | NaN | 1.0 | NaN | 1.0 | NaN | False | False | True | False | 5.0 |
| 83778 | zydeco ya ya deviled eggs | 308080 | 40 | 37779 | 2008-06-07 | ['60-minutes-or-less', 'time-to-make', 'course... | [59.2, 6.0, 2.0, 3.0, 6.0, 5.0, 0.0] | 7 | ['in a bowl , combine the mashed yolks and may... | deviled eggs, cajun-style | ... | 6.0 | 6.0 | 5.0 | NaN | 0.152027 | False | True | False | False | 5.0 |
| 83779 | cookies by design cookies on a stick | 298512 | 29 | 506822 | 2008-04-15 | ['30-minutes-or-less', 'time-to-make', 'course... | [188.0, 11.0, 57.0, 11.0, 7.0, 21.0, 9.0] | 9 | ['place melted butter in a large mixing bowl a... | i've heard of the 'cookies by design' company,... | ... | 11.0 | 7.0 | 21.0 | 9.0 | 0.201064 | False | True | False | False | 1.0 |
| 83780 | cookies by design sugar shortbread cookies | 298509 | 20 | 506822 | 2008-04-15 | ['30-minutes-or-less', 'time-to-make', 'course... | [174.9, 14.0, 33.0, 4.0, 4.0, 11.0, 6.0] | 5 | ['whip sugar and shortening in a large bowl , ... | i've heard of the 'cookies by design' company,... | ... | 14.0 | 4.0 | 11.0 | 6.0 | 0.113208 | False | True | False | False | 3.0 |
83781 rows × 24 columns
grouped_table
| n_steps | avg_rating | |
|---|---|---|
| 0 | 1 | 4.648132 |
| 1 | 2 | 4.666122 |
| 2 | 3 | 4.655457 |
| 3 | 4 | 4.640044 |
| 4 | 5 | 4.610383 |
| ... | ... | ... |
| 79 | 87 | 5.000000 |
| 80 | 88 | 3.666667 |
| 81 | 93 | 5.000000 |
| 82 | 98 | 5.000000 |
| 83 | 100 | 5.000000 |
84 rows × 2 columns
Step 4: Hypothesis Testing¶
df = recipes_merged.copy()
df['calories']=df['nutrition'].str[0].to_numpy()
threshold = df['calories'].median()
df['high_calorie'] = df['calories'] > threshold
high_ratings = df[df['high_calorie']]['avg_rating'].dropna()
low_ratings = df[~df['high_calorie']]['avg_rating'].dropna()
observed_difference = high_ratings.mean() - low_ratings.mean()
n_permutations = 1000
permutation_differences = []
for _ in range(n_permutations):
shuffled_ratings = df['avg_rating'].sample(frac=1, replace=False).reset_index(drop=True)
df['shuffled_rating'] = shuffled_ratings
high_group = df[df['high_calorie']]['shuffled_rating']
low_group = df[~df['high_calorie']]['shuffled_rating']
permutation_difference = high_group.mean() - low_group.mean()
permutation_differences.append(permutation_difference)
p_value = np.mean(np.array(permutation_differences) >= observed_difference)
print(f"Observed difference in means: {observed_difference:.4f}")
print(f"P-value: {p_value:.4f}")
if p_value < 0.05:
print("Reject H₀: High-calorie recipes have significantly higher ratings.")
else:
print("Fail to reject H₀: No significant evidence that high-calorie recipes have higher ratings.")
Observed difference in means: -0.0082 P-value: 0.9720 Fail to reject H₀: No significant evidence that high-calorie recipes have higher ratings.
import matplotlib.pyplot as plt
plt.hist(permutation_differences, bins=30, edgecolor='black')
plt.axvline(observed_difference, color='red', linestyle='dashed', linewidth=2)
plt.title("Permutation Distribution of Difference in Means")
plt.xlabel("Difference in Mean Rating (High - Low)")
plt.ylabel("Frequency")
plt.show()
Step 5: Framing a Prediction Problem¶
import numpy as np
import pandas as pd
import ast
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.model_selection import train_test_split, GridSearchCV
# --- (A) Data loading & basic feature extraction ---
recipes = pd.read_csv('RAW_recipes.csv')
interactions = pd.read_csv('interactions.csv')
# left merge two datasets
# Replace 0 ratings with NaN, then compute average rating per recipe
interactions['rating'] = interactions['rating'].replace(0, np.nan)
avg_ratings = (
interactions
.groupby('recipe_id')['rating']
.mean()
.reset_index()
.rename(columns={'recipe_id': 'id', 'rating': 'avg_rating'})
)
recipes = recipes.merge(avg_ratings, on='id', how='inner')
# Boolean flags
recipes['is_vegan'] = recipes['tags'].str.contains('vegan', na=False)
recipes['is_quick'] = recipes['tags'].str.contains('quick', na=False)
# Parse nutrition list; positions: [0]=calories, [1]=total fat, [4]=protein, [6]=carbs (example)
recipes['nutrition'] = recipes['nutrition'].apply(ast.literal_eval)
recipes['calories'] = recipes['nutrition'].apply(lambda x: x[0] if isinstance(x, list) and len(x) > 0 else np.nan)
recipes['total fat'] = recipes['nutrition'].apply(lambda x: x[1] if isinstance(x, list) and len(x) > 1 else np.nan)
recipes['protein'] = recipes['nutrition'].apply(lambda x: x[4] if isinstance(x, list) and len(x) > 4 else np.nan)
recipes['carbs'] = recipes['nutrition'].apply(lambda x: x[6] if isinstance(x, list) and len(x) > 6 else np.nan)
# (Optional) If you want to vectorize ingredients:
# recipes['ingredients_joined'] = recipes['ingredients'].apply(lambda L: " ".join(L) if isinstance(L, list) else "")
model_data = recipes[
['minutes', 'n_steps', 'is_vegan', 'is_quick',
'calories', 'total fat', 'protein', 'carbs',
'avg_rating']
].copy()
model_data = model_data.dropna(subset=['avg_rating'])
X = model_data.drop(columns=['avg_rating'])
y = model_data['avg_rating']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=42
)
# --- (B) Build preprocessing pipelines ---
numeric_features = ['minutes', 'n_steps', 'calories', 'total fat', 'protein', 'carbs']
bool_features = ['is_quick']
# 1) Numeric pipeline: median imputation → standard scaling
numeric_transformer = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
])
# 2) Boolean pipeline: cast to string so OneHotEncoder sees categories {"False","True"}
bool_to_str = FunctionTransformer(lambda X: X.astype(str), validate=False)
bool_transformer = Pipeline([
('bool2str', bool_to_str),
('onehot', OneHotEncoder(drop='if_binary', dtype=int, handle_unknown='ignore')),
])
# 3) (Optional) Ingredient pipeline (uncomment if you want to include many ingredients):
# from sklearn.feature_extraction.text import CountVectorizer
# ingredient_vectorizer = Pipeline([
# ('join', FunctionTransformer(lambda X: X.apply(lambda s: s if isinstance(s, str) else ""), validate=False)),
# ('cv', CountVectorizer(max_features=500, binary=True))
# ])
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numeric_features),
("bool", bool_transformer, bool_features),
# ("ingr", ingredient_vectorizer, ["ingredients_joined"]),
],
remainder='drop' # everything else (if any) is dropped
)
# --- (C) Baseline: LinearRegression pipeline ---
baseline_pipe = Pipeline([
('preproc', preprocessor),
('regressor', LinearRegression())
])
baseline_pipe.fit(X_train, y_train)
baseline_score = baseline_pipe.score(X_test, y_test)
print(f"Baseline LinearRegression R²: {baseline_score:.4f}")
# --- (D) Final: Lasso with hyperparameter tuning ---
lasso_pipe = Pipeline([
('preproc', preprocessor),
('lasso', Lasso(max_iter=20_000))
])
param_grid = {
'lasso__alpha': [0.001, 0.01, 0.1, 1.0, 10.0]
}
grid = GridSearchCV(
lasso_pipe,
param_grid=param_grid,
cv=5,
scoring='r2',
n_jobs=-1
)
grid.fit(X_train, y_train)
best_alpha = grid.best_params_['lasso__alpha']
final_score = grid.score(X_test, y_test)
print(f"Final Lasso R²: {final_score:.6f} (best α = {best_alpha})")
Baseline LinearRegression R²: -0.0003 Final Lasso R²: -0.000041 (best α = 0.001)
import numpy as np
import pandas as pd
import ast
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.model_selection import train_test_split, GridSearchCV
# --- (A) Data loading & basic feature extraction ---
recipes = pd.read_csv('RAW_recipes.csv')
interactions = pd.read_csv('interactions.csv')
# left merge two datasets
result = pd.merge(recipes, interactions, left_on='id', right_on='recipe_id', how='left')
# replace 0 rating to np.nan
"""
Justification for replacing 0 ratings with NaN:
1. In typical rating systems, 0 often indicates missing data rather than a true zero rating
2. Most rating scales don't include 0 (common scales are 1-5 or 1-10)
3. Including 0s would artificially lower the average ratings
4. NaN values will be automatically excluded from average calculations
"""
result['rating'] = result['rating'].replace(0, np.nan)
# get average rating for each recipe
avg_ratings = result.groupby('id')['rating'].mean().rename('avg_rating')
# add avg_rating to recipes dataset
recipes = recipes.merge(avg_ratings, left_on='id', right_index=True, how='left')
# retrieve data from nutrition list like string
recipes['nutrition'] = recipes['nutrition'].apply(ast.literal_eval)
# need values for research:calories, total fat, protein, sacurated fat, carbohydrates
recipes['calories'] = recipes['nutrition'].apply(lambda x: x[0] if isinstance(x, list) and x[0] > 0 else np.nan)
recipes['total fat'] = recipes['nutrition'].apply(lambda x: x[1] if isinstance(x, list) and x[1] > 0 else np.nan)
recipes['protein'] = recipes['nutrition'].apply(lambda x: x[4] if isinstance(x, list) and x[4] > 0 else np.nan)
recipes['saturated fat'] = recipes['nutrition'].apply(lambda x: x[5] if isinstance(x, list) and x[5] > 0 else np.nan)
recipes['carbs'] = recipes['nutrition'].apply(lambda x: x[6] if isinstance(x, list) and x[6] > 0 else np.nan)
recipes['prop saturated fat'] = recipes['nutrition'].apply(lambda x: x[5]/100*20*9/x[0] if isinstance(x, list)
and x[5] > 0 else np.nan)
recipes['is_high_calorie'] = recipes['nutrition'].apply(lambda x: True if isinstance(x, list) and x[0] > 600
else False)
recipes['is_high_saturated_fat'] = recipes['prop saturated fat'].apply(lambda x: True if x > 0.1 else False)
recipes['is_vegan'] = recipes['tags'].str.contains('vegan', na=False)
recipes['is_quick'] = recipes['tags'].str.contains('quick', na=False)
model_data = recipes[
['minutes', 'n_steps', 'is_vegan', 'is_quick',
'calories', 'total fat', 'protein', 'carbs',
'avg_rating','saturated fat','prop saturated fat','is_high_calorie','is_high_saturated_fat']
].copy()
model_data = model_data.dropna(subset=['avg_rating'])
X = model_data.drop(columns=['avg_rating'])
y = model_data['avg_rating']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=42
)
# --- (B) Build preprocessing pipelines ---
numeric_features = ['n_steps', 'protein', 'carbs','prop saturated fat','minutes','calories','total fat'] #delete 'minutes','calories','total fat','n_ingredients'
bool_features = ['is_vegan','is_quick','is_high_calorie','is_high_saturated_fat'] #'is_vegan'
# 1) Numeric pipeline: median imputation → standard scaling
numeric_transformer = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
])
# 2) Boolean pipeline: cast to string so OneHotEncoder sees categories {"False","True"}
bool_to_str = FunctionTransformer(lambda X: X.astype(str), validate=False)
bool_transformer = Pipeline([
('bool2str', bool_to_str),
('onehot', OneHotEncoder(drop='if_binary', dtype=int, handle_unknown='ignore')),
])
# 3) (Optional) Ingredient pipeline (uncomment if you want to include many ingredients):
# from sklearn.feature_extraction.text import CountVectorizer
# ingredient_vectorizer = Pipeline([
# ('join', FunctionTransformer(lambda X: X.apply(lambda s: s if isinstance(s, str) else ""), validate=False)),
# ('cv', CountVectorizer(max_features=500, binary=True))
# ])
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numeric_features),
("bool", bool_transformer, bool_features),
# ("ingr", ingredient_vectorizer, ["ingredients_joined"]),
],
remainder='drop' # everything else (if any) is dropped
)
# --- (C) Baseline: LinearRegression pipeline ---
baseline_pipe = Pipeline([
('preproc', preprocessor),
('regressor', LinearRegression())
])
baseline_pipe.fit(X_train, y_train)
baseline_score = baseline_pipe.score(X_test, y_test)
print(f"Baseline LinearRegression R²: {baseline_score:.4f}")
# --- (D) Final: Lasso with hyperparameter tuning ---
lasso_pipe = Pipeline([
('preproc', preprocessor),
('lasso', Lasso(max_iter=20_000))
])
param_grid = {
'lasso__alpha': [0.001, 0.01, 0.1, 1.0, 10.0]
}
grid = GridSearchCV(
lasso_pipe,
param_grid=param_grid,
cv=5,
scoring='r2',
n_jobs=-1
)
grid.fit(X_train, y_train)
best_alpha = grid.best_params_['lasso__alpha']
final_score = grid.score(X_test, y_test)
print(f"Final Lasso R²: {final_score:.6f} (best α = {best_alpha})")
Baseline LinearRegression R²: 0.0006 Final Lasso R²: 0.000099 (best α = 0.001)
recipes.columns
Index(['name', 'id', 'minutes', 'contributor_id', 'submitted', 'tags',
'nutrition', 'n_steps', 'steps', 'description', 'ingredients',
'n_ingredients', 'avg_rating', 'calories', 'total fat', 'protein',
'saturated fat', 'carbs', 'prop saturated fat', 'is_high_calorie',
'is_high_saturated_fat', 'is_vegan', 'is_quick'],
dtype='object')
dfs=recipes.drop(['name','id','contributor_id','submitted','tags','steps','description','ingredients','nutrition'],axis=1)
dfs#no n-steps,
| minutes | n_steps | n_ingredients | avg_rating | calories | total fat | protein | saturated fat | carbs | prop saturated fat | is_high_calorie | is_high_saturated_fat | is_vegan | is_quick | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 40 | 10 | 9 | 4.0 | 138.4 | 10.0 | 3.0 | 19.0 | 6.0 | 0.247110 | False | True | False | False |
| 1 | 45 | 12 | 11 | 5.0 | 595.1 | 46.0 | 13.0 | 51.0 | 26.0 | 0.154260 | False | True | False | False |
| 2 | 40 | 6 | 9 | 5.0 | 194.8 | 20.0 | 22.0 | 36.0 | 3.0 | 0.332649 | False | True | False | False |
| 3 | 120 | 7 | 7 | 5.0 | 878.3 | 63.0 | 20.0 | 123.0 | 39.0 | 0.252078 | True | True | False | False |
| 4 | 90 | 17 | 13 | 5.0 | 267.0 | 30.0 | 29.0 | 48.0 | 2.0 | 0.323596 | False | True | False | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 83777 | 60 | 7 | 22 | 5.0 | 415.2 | 26.0 | 44.0 | 21.0 | 15.0 | 0.091040 | False | False | False | False |
| 83778 | 5 | 1 | 13 | 5.0 | 14.8 | NaN | 1.0 | NaN | 1.0 | NaN | False | False | True | False |
| 83779 | 40 | 7 | 8 | 5.0 | 59.2 | 6.0 | 6.0 | 5.0 | NaN | 0.152027 | False | True | False | False |
| 83780 | 29 | 9 | 10 | 1.0 | 188.0 | 11.0 | 7.0 | 21.0 | 9.0 | 0.201064 | False | True | False | False |
| 83781 | 20 | 5 | 7 | 3.0 | 174.9 | 14.0 | 4.0 | 11.0 | 6.0 | 0.113208 | False | True | False | False |
83782 rows × 14 columns
import seaborn as sns
correlation_matrix=dfs.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.show()
correlation_matrix
| minutes | n_steps | n_ingredients | avg_rating | calories | total fat | protein | saturated fat | carbs | prop saturated fat | is_high_calorie | is_high_saturated_fat | is_vegan | is_quick | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| minutes | 1.000000 | 0.007812 | -0.008204 | 0.001436 | 0.004530 | 0.003575 | 0.001973 | 0.004896 | 0.005577 | 0.009248 | 0.004441 | -0.002976 | -0.000441 | -0.001703 |
| n_steps | 0.007812 | 1.000000 | 0.430158 | 0.004603 | 0.144200 | 0.106248 | 0.115211 | 0.117735 | 0.094552 | 0.056664 | 0.159903 | 0.144604 | -0.056433 | 0.005396 |
| n_ingredients | -0.008204 | 0.430158 | 1.000000 | -0.003846 | 0.124034 | 0.076080 | 0.157879 | 0.045411 | 0.049537 | -0.116194 | 0.152787 | 0.056948 | -0.018309 | 0.057894 |
| avg_rating | 0.001436 | 0.004603 | -0.003846 | 1.000000 | -0.001384 | 0.010316 | -0.004778 | 0.007625 | -0.004522 | 0.029992 | -0.002107 | 0.011956 | -0.006958 | -0.016690 |
| calories | 0.004530 | 0.144200 | 0.124034 | -0.001384 | 1.000000 | 0.864394 | 0.607082 | 0.805517 | 0.817102 | 0.058666 | 0.541075 | 0.131677 | -0.054466 | 0.010786 |
| total fat | 0.003575 | 0.106248 | 0.076080 | 0.010316 | 0.864394 | 1.000000 | 0.509770 | 0.859081 | 0.457102 | 0.203307 | 0.488539 | 0.211282 | -0.054479 | -0.014077 |
| protein | 0.001973 | 0.115211 | 0.157879 | -0.004778 | 0.607082 | 0.509770 | 1.000000 | 0.478069 | 0.303767 | -0.018933 | 0.418210 | 0.098263 | -0.085727 | -0.052880 |
| saturated fat | 0.004896 | 0.117735 | 0.045411 | 0.007625 | 0.805517 | 0.859081 | 0.478069 | 1.000000 | 0.496506 | 0.370017 | 0.446202 | 0.286664 | -0.075544 | -0.016100 |
| carbs | 0.005577 | 0.094552 | 0.049537 | -0.004522 | 0.817102 | 0.457102 | 0.303767 | 0.496506 | 1.000000 | -0.081265 | 0.373980 | -0.037454 | -0.010414 | 0.042791 |
| prop saturated fat | 0.009248 | 0.056664 | -0.116194 | 0.029992 | 0.058666 | 0.203307 | -0.018933 | 0.370017 | -0.081265 | 1.000000 | 0.098194 | 0.707777 | -0.159458 | -0.046935 |
| is_high_calorie | 0.004441 | 0.159903 | 0.152787 | -0.002107 | 0.541075 | 0.488539 | 0.418210 | 0.446202 | 0.373980 | 0.098194 | 1.000000 | 0.164865 | -0.058562 | -0.020456 |
| is_high_saturated_fat | -0.002976 | 0.144604 | 0.056948 | 0.011956 | 0.131677 | 0.211282 | 0.098263 | 0.286664 | -0.037454 | 0.707777 | 0.164865 | 1.000000 | -0.200248 | -0.029860 |
| is_vegan | -0.000441 | -0.056433 | -0.018309 | -0.006958 | -0.054466 | -0.054479 | -0.085727 | -0.075544 | -0.010414 | -0.159458 | -0.058562 | -0.200248 | 1.000000 | -0.012531 |
| is_quick | -0.001703 | 0.005396 | 0.057894 | -0.016690 | 0.010786 | -0.014077 | -0.052880 | -0.016100 | 0.042791 | -0.046935 | -0.020456 | -0.029860 | -0.012531 | 1.000000 |
import pandas as pd
import numpy as np
import ast
from pathlib import Path
import re
import requests
import time
from scipy import stats
import plotly.figure_factory as ff
import plotly.express as px
from networkx.algorithms.traversal import dfs_successors
import sys
print(sys.executable)
import plotly.io as pio
pio.renderers.default='iframe'
pd.options.plotting.backend = 'plotly'
pd.set_option('display.max_columns', None)
recipes = pd.read_csv('RAW_recipes.csv')
interactions = pd.read_csv('interactions.csv')
# left merge two datasets
result = pd.merge(recipes, interactions, left_on='id', right_on='recipe_id', how='left')
# replace 0 rating to np.nan
"""
Justification for replacing 0 ratings with NaN:
1. In typical rating systems, 0 often indicates missing data rather than a true zero rating
2. Most rating scales don't include 0 (common scales are 1-5 or 1-10)
3. Including 0s would artificially lower the average ratings
4. NaN values will be automatically excluded from average calculations
"""
result['rating'] = result['rating'].replace(0, np.nan)
# get average rating for each recipe
avg_ratings = result.groupby('id')['rating'].mean().rename('avg_rating')
# add avg_rating to recipes dataset
recipes = recipes.merge(avg_ratings, left_on='id', right_index=True, how='left')
# retrieve data from nutrition list like string
recipes['nutrition'] = recipes['nutrition'].apply(ast.literal_eval)
# need values for research:calories, total fat, protein, sacurated fat, carbohydrates
recipes['calories'] = recipes['nutrition'].apply(lambda x: x[0] if isinstance(x, list) and x[0] > 0 else np.nan)
recipes['total fat'] = recipes['nutrition'].apply(lambda x: x[1] if isinstance(x, list) and x[1] > 0 else np.nan)
recipes['protein'] = recipes['nutrition'].apply(lambda x: x[4] if isinstance(x, list) and x[4] > 0 else np.nan)
recipes['sacurated fat'] = recipes['nutrition'].apply(lambda x: x[5] if isinstance(x, list) and x[5] > 0 else np.nan)
recipes['carbs'] = recipes['nutrition'].apply(lambda x: x[6] if isinstance(x, list) and x[6] > 0 else np.nan)
recipes['prop sacurated fat'] = recipes['nutrition'].apply(lambda x: x[5]/100*20*9/x[0] if isinstance(x, list)
and x[5] > 0 else np.nan)
recipes['is_high_calorie'] = recipes['nutrition'].apply(lambda x: True if isinstance(x, list) and x[0] > 600
else False)
recipes['is_high_sacurated_fat'] = recipes['prop sacurated fat'].apply(lambda x: True if x > 0.1 else False)
# Univariate Analysis
fig1 = px.histogram(recipes, x='prop sacurated fat', nbins=10, title='Distribution of Proportion of '
'Sacurated Fat in Recipes')
fig1.update_layout(xaxis_title='Proportion of Sacurated Fat in Recipe', yaxis_title='Count')
fig1.show()
fig2 = px.scatter(recipes, x='prop sacurated fat', y='avg_rating',
title='Proportion of Sacurated fat vs. Average Recipe Rating',
labels={'prop sacurated fat': 'Proportion of Sacurated fat', 'avg_rating': 'Average Rating'},
opacity=0.5)
fig2.show()
# Bivariate Analysis
recipes['rating_rounded'] = recipes['avg_rating'].dropna().round().astype(int)
recipes['rating_rounded'] = recipes['rating_rounded'].clip(1, 5) #round the ratings to 1 to 5
df_plot1 = recipes.groupby(['is_high_calorie', 'rating_rounded']).size().reset_index(name='count')
df_plot2 = recipes.groupby(['is_high_sacurated_fat', 'rating_rounded']).size().reset_index(name='count')
fig = px.bar(
df_plot1,
x='rating_rounded',
y='count',
color='is_high_calorie',
barmode='group',
title='High Calories vs Common Calories',
labels={'rating_rounded': 'Rating', 'count': 'Count'},
category_orders={'avg_rating': [1, 2, 3, 4, 5],
'is_high_calorie': [True, False]},
)
fig.show()
fig = px.bar(
df_plot2,
x='count',
y='rating_rounded',
color='is_high_sacurated_fat',
barmode='group',
orientation='h',
title='High Sacurated Fat vs Common recipes',
labels={'rating_rounded': 'Rating', 'count': 'Count'},
category_orders={'avg_rating': [1, 2, 3, 4, 5],
'is_high_sacurated_fat': [True, False]},
)
fig.show()
# Assessment of Missingness
def plot_permutation_test_histogram(observed_diffs, observed_stat, test_column):
fig = px.histogram(x=observed_diffs,
nbins=50,
title=f'Permutation Test for {test_column.replace("_", " ").title()}',
labels={'x': 'Absolute Mean Difference', 'y': 'Probability'},
histnorm='percent',
)
# Add vertical line for observed statistic
fig.add_vline(x=observed_stat, line_width=2, line_dash="dash", line_color="red",
annotation_text=f'Observed: {observed_stat:.4f}',
annotation_position="top right")
fig.update_layout(
showlegend=True,
xaxis_title='Absolute Mean Difference',
yaxis_title='Probability',
hovermode='x'
)
return fig
def test_missingness_dependency_plotly(df, test_column, missing_column='avg_rating'):
df = df.copy()
df['missing'] = df[missing_column].isna()
# Split into two groups
present = df[~df['missing']][test_column].dropna()
missing = df[df['missing']][test_column].dropna()
# draw compare scatter plot
fig = ff.create_distplot(
hist_data=[present, missing],
group_labels=['True', 'False'],
show_rug=False, show_hist=False
)
fig.update_layout(title=f'{test_column} by Missingness of Rating')
fig.show()
observed_stat = np.abs(present.mean() - missing.mean())
# Permutation test
observed_diffs = []
n_permutations = 1000
combined = df[test_column].dropna().values
for _ in range(n_permutations):
np.random.shuffle(combined)
perm_present_mean = np.mean(combined[:len(present)])
perm_missing_mean = np.mean(combined[len(present):])
diff = np.abs(perm_present_mean - perm_missing_mean)
observed_diffs.append(diff)
observed_diffs = np.array(observed_diffs)
observed_diffs = observed_diffs[~np.isnan(observed_diffs)]
# Calculate p-value
p_value = (np.sum(observed_diffs >= observed_stat) + 1) / (len(observed_diffs) + 1)
fig = plot_permutation_test_histogram(observed_diffs, observed_stat, test_column)
return {
'test_column': test_column,
'present_mean': present.mean(),
'missing_mean': missing.mean(),
'observed_statistic': observed_stat,
'p_value': p_value,
'conclusion': 'Reject null' if p_value < 0.05 else 'Fail to reject null',
'figure': fig
}
# Test for prop sacurated fat
result_sacurated_fat = test_missingness_dependency_plotly(recipes, 'prop sacurated fat')
print("Sacurated Fat Proportion Results:")
print(f"Present mean: {result_sacurated_fat['present_mean']:.6f}")
print(f"Missing mean: {result_sacurated_fat['missing_mean']:.6f}")
print(f"Observed statistic: {result_sacurated_fat['observed_statistic']:.6f}")
print(f"P-value: {result_sacurated_fat['p_value']:.6f}")
print(f"Conclusion: {result_sacurated_fat['conclusion']}")
result_sacurated_fat['figure'].show()
# Test for n_steps
result_steps = test_missingness_dependency_plotly(recipes, 'n_steps')
print("\nNumber of Steps Results:")
print(f"Present mean: {result_steps['present_mean']:.6f}")
print(f"Missing mean: {result_steps['missing_mean']:.6f}")
print(f"Observed statistic: {result_steps['observed_statistic']:.6f}")
print(f"P-value: {result_steps['p_value']:.6f}")
print(f"Conclusion: {result_steps['conclusion']}")
result_steps['figure'].show()
# Hypothesis Testing
def fat_rating_permutation_test(df, fat_col='is_high_sacurated_fat', rating_col='avg_rating', n_permutations=1000):
df_clean = df[[fat_col, rating_col]].dropna()
high_fat = df_clean[df_clean[fat_col]][rating_col]
low_fat = df_clean[~df_clean[fat_col]][rating_col]
observed_stat = high_fat.mean() - low_fat.mean()
combined = df_clean[rating_col].values
n_high_fat = len(high_fat)
perm_diffs = []
for _ in range(n_permutations):
np.random.shuffle(combined)
perm_high_fat_mean = combined[:n_high_fat].mean()
perm_low_fat_mean = combined[n_high_fat:].mean()
perm_diffs.append(perm_high_fat_mean - perm_low_fat_mean)
p_value = (np.sum(perm_diffs >= observed_stat) + 1) / (len(perm_diffs) + 1)
fig = px.histogram(x=perm_diffs,
nbins=50,
title='Distribution of Mean Difference in Rating (Fat - low-Fat)',
labels={'x': 'Mean Difference (Fat - low-Fat)', 'y': 'frequency'},
opacity=0.7)
fig.add_vline(x=observed_stat, line_width=2, line_dash="dash", line_color="red",
annotation_text=f'Observed: {observed_stat:.4f}',
annotation_position="top right")
fig.update_layout(
showlegend=False,
xaxis_title='Mean Difference (Fat - low-Fat)',
yaxis_title='Frequency',
hovermode='x',
annotations=[dict(text=f'p_value: {p_value:.4f}', x=0.05, y=0.95,
xref='paper', yref='paper', showarrow=False)]
)
return {
'high_fat_mean': high_fat.mean(),
'low_fat_mean': low_fat.mean(),
'observed_statistic': observed_stat,
'p_value': p_value,
'n_high_fat': len(high_fat),
'n_low_fat': len(low_fat),
'conclusion': 'Reject null' if p_value < 0.05 else 'Can not reject null',
'figure': fig
}
results = fat_rating_permutation_test(recipes)
print(f"Sacurated Fat Recipes: {results['high_fat_mean']:.4f}")
print(f"Common Recipes: {results['low_fat_mean']:.4f}")
print(f"Observed statstics: {results['observed_statistic']:.4f}")
print(f"P_value: {results['p_value']:.4f}")
print(f"Conclusion: {results['conclusion']}")
print(f"High sacurated fat recipes: {results['n_high_fat']}")
print(f"Common recipes: {results['n_low_fat']}")
results['figure'].show()
d:\miniforge3\envs\dsc80\python.exe
Sacurated Fat Proportion Results: Present mean: 0.164344 Missing mean: 0.170520 Observed statistic: 0.006176 P-value: 0.006993 Conclusion: Reject null
Number of Steps Results: Present mean: 10.058948 Missing mean: 11.551936 Observed statistic: 1.492987 P-value: 0.000999 Conclusion: Reject null
Sacurated Fat Recipes: 4.6317 Common Recipes: 4.6161 Observed statstics: 0.0156 P_value: 0.0010 Conclusion: Reject null High sacurated fat recipes: 48174 Common recipes: 32999
Step 6: Baseline Model¶
avg_ratings
Step 7: Final Model¶
# add total fat, protein